A prognostic risk model based on DNA methylation levels of genes and lncRNAs in lung squamous cell carcinoma

Background Recurrence is a risk factor for the prognosis of lung squamous carcinoma (LUSC). DNA methylation levels of RNAs are also associated with LUSC prognosis. This study aimed to construct a prognostic model with high performance in predicting LUSC prognosis using the methylation levels of lncRNAs and genes. Methods The differentially expressed RNAs (DERs) and differentially methylated RNAs (DMRs) between the recurrent and non-recurrent LUSC tissues in The Cancer Genome Atlas (TCGA; training dataset) were identified. Weighted correlation network analysis was performed to identify co-methylation networks. Differentially methylated genes and lncRNAs with opposite expression-methylation levels were used for the screening of prognosis-associated RNAs. The prognostic model was constructed and its performance was validated in the GSE39279 dataset. Results A total of 664 DERs and 981 DMRs (including 972 genes) in recurrent LUSC tissues were identified. Three co-methylation modules, including 226 differentially methylated genes, were significantly associated with LUSC. Among prognosis-associated RNAs, 18 DERs/DMRs with opposite methylation-expression levels were included in the methylation prognostic risk model. LUSC patients with high risk scores had a poor prognosis compared with patients who had low risk scores (TCGA: HR = 3.856, 95% CI [2.297–6.471]; GSE39279: HR = 3.040, 95% CI [1.435–6.437]). This model had a high accuracy in predicting the prognosis (AUC = 0.903 and 0.800, respectively), equivalent to the nomogram model inclusive of clinical variables. Conclusions Referring to the methylation levels of the 16-RNAs might help to predict the survival outcomes in LUSC.


INTRODUCTION
Lung squamous carcinoma or lung squamous cell carcinoma (LUSC) is a histologic type of lung cancer that ranks first in the rate of incidence and mortality (Bray et al., 2018). The number of newly diagnosed cases and cancer-related death of lung cancer in 2018 was approximately 2.1 million and 1.8 million, respectively (Bray et al., 2018). Non-small cell
The methylation microarray dataset GSE39279 was downloaded from the Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI; https://www.ncbi.nlm.nih.gov/geo/). It was selected using the following criteria: (1) methylation profiles from patients with LUSC; (2) inclusive of clinical recurrence information; (3) ≥150 samples. GSE39279 (GPL13534, Illumina HumanMethylation450 BeadChip) contains 444 samples, including 43 samples with recurrence information. This dataset was used as the validation dataset.

RNA annotation and identification of RNAs differentially expressed and methylated in recurrent LUSC
The annotation of lncRNAs and mRNAs in the expression and methylation files was performed using the HUGO Gene Nomenclature Committee (HGNC; http://www. genenames.org/). Ensembl IDs were converted to official gene symbols. The differentially expressed RNAs (DER) and differentially methylated RNAs (DMR), including lncRNAs and genes, between the samples with and without recurrence were identified in the TCGA training dataset. The limma package (version 3.34.7; https://bioconductor.org/packages/ release/bioc/html/limma.html) in R was used for the identification of DERs and DMRs. To maximize the intersection of DERs and DMRs, we set p < 0.05, false discovery rate (FDR) < 0.05, and |log 2 (Fold change, FC)| > 0.263 as the criteria for significant difference. The expression profiles of RNAs with differential expression and methylation levels were presented using the bidirectional hierarchical clustering heatmap by pheatmap (version 1.0.8; https://cran.r-project.org/web/packages/pheatmap/index.html). Besides, RNAs with both differential methylation and expression levels were identified using the Venn diagram.

Weighted correlation network analysis (WGCNA) of genes
The WGCNA networks identify gene modules associated with disease status based on the expression/methylation profiles of genes. The WGCNA package (version 1.63; https:// cran.r-project.org/web/packages/WGCNA/index.html) in R was used to analyze the co-methylation networks for RNAs in the TCGA training dataset, irrespective of the expression and methylation level. The criteria for WGCNA module identification were: min size = 100, cutHeight = 0.995, p < 0.05, and enrichment fold > 1. Genes included in WGCNA modules were matched with the RNAs with differentially expression and methylation levels, and the overlapping items were used for further analysis.

Construction of methylation prognostic model
Genes and lncRNAs included in WGCNA co-methylation modules that had opposite levels of methylation and expression between the TCGA LUSC samples with and without recurrence were used as candidates for screening prognosis-associated RNAs. The univariate Cox regression analysis in the R Survival package (version 2.41-1; http:// bioconductor.org/packages/survivalr/) was used for the screening of RNAs associated with LUSC prognosis. The criterion was log-rank p value < 0.05. The optimal methylation combination was subsequently identified using the Cox-Proportional Hazards (Cox-PH) model (L1-penalized least absolute shrinkage and selection operator, LASSO) in the R penalized package (version 0.9-50, http://bioconductor.org/packages/penalized/). The methylation prognostic model was then established and the prognostic risk score of each individual was calculated using the following algorithm: Prognostic risk score = ∑coef DMRs × Methylation DMRs , where coef represents the coefficient (LASSO coef) of each gene identified by the Cox-PH model and Methylation notes the methylation level of the gene in each sample, respectively. For the Kaplan-Meier survival curve analysis (version 2.41-1), samples in the training and validation datasets were grouped into the high and low-risk groups according to the median prognostic risk score. Also, the receiver operating characteristic (ROC) curve analysis was performed using the R pROC (version 1.14.0; https://cran.r-project.org/web/packages/pROC/index.html) to evaluate the accuracy of using this methylation prognostic model in predicting LUSC prognosis.

Identification of clinical variables associated with LUSC prognosis
This study also identified clinical variables associated with the LUSC prognosis in the TCGA training dataset. The associations of prognostic risk score and clinical variables (including patient age, gender, pathologic TNM classification, tumor stage, and radiation/ targeted therapy) with LUSC prognosis were assessed using the univariate and multivariate Cox regression analysis in the R Survival package (version 2.41-1). Independent variables that were significantly associated with LUSC prognosis were selected using the threshold of log-rank p value < 0.05.

Prognostic nomogram and index for survival
Prognostic nomogram and index are widely applied for estimated survival probability among patients with cancers and other conditions (Barnholtz-Sloan et al., 2012;Gold et al., 2009). Nomogram was established using the R "RMS" package (Version 5.1-2; https://cran. r-project.org/web/packages/rms/index.html). The score of each variable was ascribed according to its weight in the nomogram, and the individualized 3-and 5-year survival probabilities were then predicted according to total points.

Functional annotation for lncRNAs in the methylation prognostic model
Importantly, the functional annotations of differentially expressed lncRNAs including in the methylation prognostic model were identified to investigate the biological themes associated with lncRNAs. The correlation between expression profiles of differentially expressed genes and lncRNAs in recurrent LUSC samples in the TCGA training dataset were calculated using the Pearson correlation coefficient (r). LncRNA-gene pairs with confident coexpression levels (r > 0.4 or r < -0.4) were retained and the lncRNA-mRNA regulatory network was constructed using the Cytoscape software (version 3.8.0; http://www.cytoscape.org/). Using the expression profiles of the lncRNA-associated genes in the TCGA training samples, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways associated with lncRNAs were identified using the Gene Set Enrichment Analysis (GSEA) software (version 4.0.2; http://software.broadinstitute.org/gsea/index. jsp). The threshold for significant enrichment was a normal p value < 0.05.

Identification of DERs and DMRs
After annotation in the HGNC database, a total of 54 lncRNAs and 12,932 genes with available expression profiles were obtained in the TCGA dataset, including 664 DERs and 981 DMRs (including 972 genes and nine lncRNAs) between the LUSC samples with and without recurrence (Figs. 1A and 1B). The expression and methylation profiles of the DERs and DMRs in the LUSC samples are shown in Figs. 1C and 1D, respectively. Venn diagram showed that 155 DERs were differentially methylated in recurrent LUSC samples compared with samples without tumor recurrence ( Fig. 2; Table S1).

WGCNA for DERs with differentially methylation levels in LUSC samples
Prior to the WGCNA module analysis, the scale-free topology prerequisite soft-thresholding power was identified. It was 7 when the scale-free topology model fit R 2 = 0.9 for the first time (Fig. 3A). The mean gene connectivity = 1 when soft-thresholding power = 7 (Fig. 3B), conforming to the scale-free network property. Based on the methylation profiles of all RNAs, 12 WGCNA modules (including 105-1327 RNAs) were subsequently identified in the TCGA training dataset according to the criteria we set before (soft threshold power = 7, min size = 100, and cutHeight = 0.995; Fig. 3C; Table 1). The module-trait relationship heatmap showing the correlations of modules with clinical variables is shown in Fig. 4. Also, the numbers of differentially methylated genes and fold enrichment value of each module are shown in Table 1. Three WGCNA co-methylation modules (green, pink, and turquoise), including 226 differentially methylated genes, had enrichment folds of >1 and an enrichment p value of <0.05.

Identification of LUSC prognosis-associated RNAs
After matching the above 226 WGCNA module genes with the 155 RNAs in Fig. 2, 41 common genes were identified (Table S3). Among them, 28 genes had opposite methylation-expression levels in the TCGA LUSC samples (Table S3). Besides, six of the nine lncRNAs listed in Table S1 had opposite methylation-expression levels. Among the 34 candidate DERs (including 28 genes and six lncRNAs), 26 DERs (including four lncRNAs and 22 genes) were identified as prognosis-associated RNAs by univariate Cox regression analysis and methylation profiles (Table S4).

Accuracy of the methylation prognostic model in predicting prognosis in LUSC
To validate the predictive power of the methylation prognostic model in LUSC, samples in the TCGA training dataset (n = 293) and the GSE39279 validation dataset (n = 43) were separately grouped into the high and low-risk groups according to the median risk score of each sample. Kaplan-Meier survival curve analysis showed that there was a significant difference in the recurrence-free survival ratio between patients with high and low-risk scores in the TCGA training dataset (HR = 3.856, 95% CI [2.297-6.471], p = 2.648e -08;

Clinical variables associated with LUSC prognosis
We also identified the clinical variables associated with the prognosis of LUSC in the TCGA cohort.

Prognostic nomogram
The prognostic nomogram was constructed using the methylation prognostic model, pathologic stage (1-4), and radiation therapy (yes/no; Fig. 7A). The predicted 3-and 5-year survival probabilities based on the nomogram had high compliances with the actual   Pathways associated with the two lncRNAs in the methylation prognostic model A total of 320 lncRNA-mRNA pairs related to the two lncRNAs (DIRC3 and RMST, downregulation and hypermethylation) in the methylation prognostic model were extracted (r > 0.4 or r < -0.4; Table S5). Accordingly, the lncRNA-mRNA network included 320 interactions and 281 nodes (two lncRNAs and 279 differentially expressed genes; Fig. 8). Downregulated genes with hypermethylation levels, including LTF, ADH7, ST6GALNAC1, THNSL2, and WFDC10B, had positive correlations with RMST, genes including ABCA12, DGKA, FAM181B, and TRIM7 were positively correlated with DIRC3, LIMCH1 had negative correlation with both of RMST and DIRC3, and the other genes including SGCG, BNIPL, GNRH2, HORMAD2, NPHP3, and RTP1 had positive correlations with both of RMST and DIRC3 (Table S5). Based on the expression profiles of the 279 genes in the TCGA LUSC cohort, we identified that DIRC3 and RMST were associated with five and four KEGG pathways, respectively (Table 5). DIRC3 and RMST both were associated with the "Pathways in cancer" and "Endocytosis".  univariate Cox regression analysis in R package was identified in our study. The methylations of seven RNAs were associated with poor prognosis in LUSC patients by the univariate Cox regression analysis, but the Multivariate Cox PH model showed that they were associated with good prognosis in LUSC. However, further analyses showed that there were confident correlations among the two lncRNAs RMST and DIRC3 and the other 16 genes, which could be the cause of the above inconsistency. Also, the lncRNA-mRNA network showed that the genes might be regulated by the two lncRNAs. These results showed that the two lncRNAs RMST and DIRC3 might be confounding factors for the other genes.

DISCUSSION
In the methylation prognostic model, LIM and calponin-homology domains 1 (LIMCH1) is the only gene downregulated and positively correlated with the LUSC prognosis by univariate Cox regression analysis (HR = 7.96E-03). LIMCH1 regulates the activity of an important motor protein monmuscle myosin II and cell migration and growth in lung cancer cells (Lin et al., 2017;Zhang, Zhang & Xu, 2019). It partially restored the invasive phenotype of cancer cells (Bersini et al., 2020). However, Halle et al. (2021) showed that high LIMCH1 expression was significantly associated with poor survival  of cervical cancer (p = 0.004, HR = 3.17). Also, our study showed that high LIMCH1 methylation level contributed to a low risk score and was significantly associated with a good prognosis of LUSC. These results showed that the contribution of expression and methylation levels of LIMCH1 and other RNAs in the methylation prognostic model to LUSC prognosis should be validated in more cohorts. The HORMA Domain Containing 2 (HORMAD2) gene is a host gene of HIV proviruses and it has one provirus integration, which is essential for synapsis surveillance during host cell meiosis (Fennessey et al., 2017;Symons, Cameron & Lewin, 2018). Also, this increases the difficulty of virus removal and provides an implication for virus management (Symons, Cameron & Lewin, 2018). The tripartite motif-containing 7 (TRIM7) gene is a tumor suppressor gene that encodes glycogenin-interacting proteins and an E3 ubiquitin ligase, which suppresses the progression of hepatocellular carcinoma (Jin et al., 2020;Zhu et al., 2020). E3 ubiquitin ligases are involved in various cellular functions, including innate immunity and inflammation (Giraldo et al., 2020a). Loss of N6-Methyladenosine modification of TRIM7 promoter was observed in osteosarcoma tissues (Zhou et al., 2020). Evidence shows that E3 ubiquitin ligase positively regulates toll-like receptor 4 (TLR4)-mediated immune response in macrophages (Lu et al., 2019). Of special interest are recent reports showing their roles in virus replication (Giraldo et al., 2020a;Stukalov et al., 2020).
TRIM7 can either promote virus pathogenesis or protect against infection, depending on the context of virus infection (Giraldo et al., 2020a;Orchard et al., 2019). It promotes herpes virus and Zika virions infection (Giraldo et al., 2020b) but may play an antiviral role against norovirus (Orchard et al., 2019). TRIM7-mediated ubiquitination of the envelope protein of Zika virions promotes the release of virus from infected cells and binding of envelope protein to host cellular receptors, and enhances viral entry and replication (Orchard et al., 2019). A recent study by Stukalov et al. (2020) highlighted that the TRIM7 protein binding to the SARS-CoV-2 M phosphorylation site to drive ubiquitination. However, TRIM7's effect on SARS-CoV-2 replication is not clear now.
Our present study found that the DNA methylation levels of HORMAD2 and TRIM7 promoter were upregulated in recurrent LUSC tissues compared with non-recurrent tissues, and HORMAD2 and TRIM7 expression were downregulated in recurrent LUSC tissues. The deregulation and methylation levels of HORMAD2 and TRIM7 in recurrent LUSC tissues might provide references for exploring the potential association of LUSC prognosis with virus infection. LncRNAs rhabdomyosarcoma 2 associated transcript (RMST) and disrupted in renal carcinoma 3 (DIRC3) both, especially the former, are known as tumor suppressors (Coe et al., 2019;Kong, Liu & Kong, 2018;Liu et al., 2019;Wang et al., 2018). The depletion and loss of function mutation of the two lncRNAs eliminate the threat of malignant transformation, promote tumor cell proliferation, migration, and invasion (Coe et al., 2019;Liu et al., 2019;Wang et al., 2018). RMST could enhance tumor cell apoptosis, block the G0/G1 phase and cell proliferation, and restrain cell invasion and migration in triple-negative breast cancer (Wang et al., 2018). RMST (Peng et al., 2020) positively regulates the DNA methyltransferase 3B (DNMT3B) and negatively regulates DNMT3 expression (Peng et al., 2020). RMST knockout suppresses DNMT3 by promoting the binding of HuR protein to DNMT3B and enhancing DNMT3B stability (Peng et al., 2020). What's more, the deletion of Dnmt3a in mouse promotes lung tumor progression (Gao et al., 2011), suggesting the critical roles of RMST in regulating tumor progression and development.
Among the downregulated and hypermethylated genes associated with both RMST and DIRC3 (including BNIPL, HORMAD2, and NPHP3), HORMAD2 is the only gene related to lung cancer (Liu et al., 2012). Liu et al. (2012) showed that HORMAD2 mRNA expression was rarely expressed in the lung and HORMAD2 protein was detected more frequently in early-stage lung adenocarcinoma compared with advanced cancers. The nephronophthisis 3 (NPHP3) gene is associated with nephronophthisis and is necessary for the formation of primary cilia formation (Abdullah et al., 2017;Lee, Kim & Moon, 2019). BNIP-2-like (BNIPL) is an apoptosis-associated protein that interacts with cell proliferation-related proteins including BCL-2 and Cdc42GAP (Qin et al., 2003). However, our present study showed that these genes and lncRNAs were downregulated in recurrent LUSC specimens, along with increased methylation levels. Besides, the inclusion of them in the prognostic model had high performance in predicting the prognosis in LUSC patients, suggesting the novel and crucial roles of these genes in LUSC progression.

CONCLUSIONS
In summary, our study identified a prognostic model based on the methylation levels of lncRNAs and genes in LUSC patients with and without recurrence. This methylation had high performance in predicting the prognosis as well as the 3-year and 5-year survival probabilities in LUSC patients. Of special interest is that most of these RNAs had not been reported in lung cancers and two genes (HORMAD2 and TRIM7) might associate with virus infection. Referring to the methylation levels of these RNAs might help to predict the survival outcomes in LUSC.